Graphical Descriptive Statistics

Michael Luu, MPH
Biostatistics & Bioinformatics Research Center
Cedars Sinai Medical Center
August 13, 2022

Why do we need to visualize our data?

Data

x y
55.3846 97.1795
51.5385 96.0256
46.1538 94.4872
42.8205 91.4103
40.7692 88.3333
38.7179 84.8718
35.6410 79.8718
33.0769 77.5641
28.9744 74.4872
26.1538 71.4103
x y
58.21361 91.88189
58.19605 92.21499
58.71823 90.31053
57.27837 89.90761
58.08202 92.00815
57.48945 88.08529
28.08874 63.51079
28.08547 63.59020
28.08727 63.12328
27.57803 62.82104
x y
38.33776 92.47272
35.75187 94.11677
32.76722 88.51829
33.72961 88.62227
37.23825 83.72493
36.02720 82.04078
39.23928 79.26372
39.78452 82.26057
35.16603 84.15649
40.62212 78.54210
x y
55.99303 79.27726
50.03225 79.01307
51.28846 82.43594
51.17054 79.16529
44.37791 78.16463
45.01027 77.88086
48.55982 78.78837
42.14227 76.88063
41.02697 76.40959
34.57531 72.72484

Let’s begin by taking descriptive measures

dataset n mean_x sd_x mean_y sd_y
A 142 54.26 16.76 47.84 26.94
B 142 54.26 16.76 47.84 26.94
C 142 54.26 16.76 47.84 26.94
D 142 54.26 16.76 47.84 26.94


It appears the counts (n), mean (x), mean (y), and sd (x) and sd (y) are identical for ALL four datasets!

Can we conclude the datasets are similiar or identical?

Not quite yet!

Let’s visualize the relationship of x and y

Although simple quantitative summaries are similar …

They can appear drastically different when visualized!

Datasaurus Dozen